In this script, it’s intended to give an explanation on data features, using different methods and graphs. The provided data description is based on three datasets:
The joined dataset which only haves information from 2024
The joined dataset which contains information from 2015 to 2024
The WHO TB burden estimates [>1Mb] dataset, as it contains information from previous years
Overall, the document aims to do some light-weight analysis and data exploration, prior to the heavy-weight analysis in the subsequent 2x analysis files.
Loading relevant libraries:
library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr 1.1.4 ✔ readr 2.1.5
✔ forcats 1.0.0 ✔ stringr 1.5.1
✔ ggplot2 3.5.1 ✔ tibble 3.2.1
✔ lubridate 1.9.4 ✔ tidyr 1.3.1
✔ purrr 1.0.4
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(dplyr)library(ggplot2)#Access to the function for loading the datasets and to save themsource("99_proj_func.R")
Loading data:
Joined dataset, year 2024:
#Loading age and sex and risk group data (2024 only) - Joined and augmented version of the data: data_file <-"03_aug_TB_age_sex.tsv"TB_age_sex_joined <-load_data(data_file)
Loading ../data/03_aug_TB_age_sex.tsv from local file…
#Display the data: slice_sample(TB_age_sex_joined, n=5)
# A tibble: 5 × 18
country year age_group sex risk_factor TB_cases_best TB_cases_min
<chr> <dbl> <chr> <chr> <chr> <dbl> <dbl>
1 Bahrain 2024 15-24 Fema… no risk fa… 1 1
2 Spain 2024 65+ Male no risk fa… 530 170
3 South Africa 2024 15+ Fema… no risk fa… 101000 37000
4 Kyrgyzstan 2024 15+ Both no risk fa… 8500 6300
5 United States of… 2024 5-9 Both no risk fa… 130 67
# ℹ 11 more variables: TB_cases_max <dbl>, population_size <dbl>,
# total_TB_cases_best <dbl>, total_TB_cases_min <dbl>,
# total_TB_cases_max <dbl>, TB_cases_pr_100k_best <dbl>,
# TB_cases_pr_100k_min <dbl>, TB_cases_pr_100k_max <dbl>,
# total_TB_cases_pr_100k_best <dbl>, total_TB_cases_pr_100k_min <dbl>,
# total_TB_cases_pr_100k_max <dbl>
Joined dataset, year 2015 to 2024:
#Loading age and sex and risk group data (2024 only) - Joined version of the data: data_file <-"03_aug_TB_10_years.tsv"TB_10_years_joined <-load_data(data_file)
Loading ../data/03_aug_TB_10_years.tsv from local file…
#Display the data: slice_sample(TB_10_years_joined, n=5)
#Loading the data dictionary: data_file <-"01_load_dictionary.tsv"TB_dictionary <-load_data(data_file)
Loading ../data/01_load_dictionary.tsv from local file…
Warning: One or more parsing issues, call `problems()` on your data frame for details,
e.g.:
dat <- vroom(...)
problems(dat)
slice_sample(TB_dictionary, n=5)
# A tibble: 5 × 4
variable_name dataset code_list definition
<chr> <chr> <chr> <chr>
1 private_sector_link UNHLM commitments 0=No; 1=yes Have links with the private…
2 lmis Laboratories <NA> Number of sites which by th…
3 budget_cpp_tpt Budget <NA> Average cost of drugs budge…
4 new_sp_fu Notification <NA> New pulmonary smear positiv…
5 xdr_died Outcomes <NA> Outcomes for XDR-TB cases: …
Data description:
*TB_age_sex_joined (2024) - Description:
This dataset contains the number of TB cases across different countries, categorized by age group and gender. It also includes cases per 100,000 population, enabling standardized and comparable analysis between countries.
slice_sample(TB_age_sex_joined, n=5)
# A tibble: 5 × 18
country year age_group sex risk_factor TB_cases_best TB_cases_min
<chr> <dbl> <chr> <chr> <chr> <dbl> <dbl>
1 Marshall Islands 2024 25-34 Female no risk fa… 2 0
2 Bahamas 2024 55-64 Male no risk fa… 8 3
3 Kuwait 2024 18+ Both diabetes 71 44
4 Colombia 2024 5-9 Both no risk fa… 170 91
5 Zimbabwe 2024 15-24 Both no risk fa… 4300 1400
# ℹ 11 more variables: TB_cases_max <dbl>, population_size <dbl>,
# total_TB_cases_best <dbl>, total_TB_cases_min <dbl>,
# total_TB_cases_max <dbl>, TB_cases_pr_100k_best <dbl>,
# TB_cases_pr_100k_min <dbl>, TB_cases_pr_100k_max <dbl>,
# total_TB_cases_pr_100k_best <dbl>, total_TB_cases_pr_100k_min <dbl>,
# total_TB_cases_pr_100k_max <dbl>
We can briefly explore the big differences between comparing countries based on their total TB cases vs comparing with TB cases pr. 100k citizens.
Scatter plot of the countries with the top 10 most TB cases in total:
#Getting the 10 countries with highest total amount of TB cases: top_10_countries <- TB_age_sex_joined |>group_by(country) |>summarise(total_TB =first(total_TB_cases_best)) |>arrange(desc(total_TB)) |>slice_head(n =10) |>pull(country)#Making a tibble for those countries, and plotting them: plot <- TB_age_sex_joined |>filter(country %in% top_10_countries) |>group_by(country) |>summarise(mean_best =sum(TB_cases_best, na.rm =TRUE), #Sum of TB cases for this country min_val =sum(TB_cases_min, na.rm =TRUE),max_val =sum(TB_cases_max, na.rm =TRUE), ) |>ggplot(aes(x = mean_best, y =fct_reorder(country, mean_best))) +#fct_reorder(country, mean_best) ensures that we order sort all countries,#based on mean_best value (descending order). geom_point(size =3, color ="orange") +geom_errorbarh(aes(xmin = min_val, xmax = max_val), height =0.2) +labs(x ="TB cases\n(Best estimate with min/max)",y ="Country",title ="Top 10 Countries - Total TB cases, with error bars" ) +theme_minimal()#Save itggsave(filename ="../results/04_1_top10_TB.png", #Choose the folder + filenameplot = plot,width =8, # inchesheight =5, # inchesdpi =300# high quality)plot
Note that really large countries like China and India is part of this graph.
That is not to state that TB is not a problem in these countries, but it is to highlight that the TB intensity might not be as bad as you might think.
This will make sense once you glance at the following plot.
Scatter plot of the countries with the top 10 most TB cases pr. 100k citizens (standardized):
#Making an object for storing the top 10 countries with most TB cases: top_10_countries_100k <- TB_age_sex_joined |>group_by(country) |>summarise(total_TB_cases_pr_100k_best =first(total_TB_cases_pr_100k_best),#Note: The value is constant for each country, so we just use first()#in order to pick the first value (we want to reduce several rows #to 1 row pr. country) ) |>arrange(desc(total_TB_cases_pr_100k_best)) |>slice_head(n =10) |>pull(country)#Making a tibble for those countries, and plotting them: plot <- TB_age_sex_joined |>filter(country %in% top_10_countries_100k) |>group_by(country) |>summarise(mean_best =first(total_TB_cases_pr_100k_best),min_val =first(total_TB_cases_pr_100k_min),max_val =first(total_TB_cases_pr_100k_max), ) |>ggplot(aes(x = mean_best, y =fct_reorder(country, mean_best))) +#fct_reorder(country, mean_best) ensures that we order sort all countries,#based on mean_best value (descending order). geom_point(size =3, color ="orange") +geom_errorbarh(aes(xmin = min_val, xmax = max_val), height =0.2) +labs(x ="TB cases pr. 100k\n(Best estimate with min/max)",y ="Country",title ="Top 10 Countries - TB cases pr. 100k citizens, with error bars" ) +theme_minimal()#Save itggsave(filename ="../results/04_2_top10_TB_100k.png", #Choose the folder + filenameplot = plot,width =8, # inchesheight =5, # inchesdpi =300# high quality)
plot
Would you look at that!
Except for the Philippines, none of these countries were part of the plot for the “top 10 total TB cases countries”.
It goes to show that the standardized TB cases pr. 100k of citizens might be a better measure for TB disease intensity of a country.
*TB_10_years_joined (2015-2024) - Description:
We combined three WHO datasets — TB burden estimates, MDR/RR-TB burden estimates, and TB infection in household contacts — into a single multi-country, multi-year panel covering approximately 10 years.
The merged dataset contains measures of TB incidence and mortality, MDR/RR-TB incidence, and estimated household infection rates for each country–year.
This integrated dataset allows us to describe global TB trends, compare drug-resistant and drug-sensitive TB, and evaluate household transmission indicators.
(Note: Multidrug-resistant tuberculosis (MDR-TB) is defined as disease due to Mycobacterium tuberculosis that is resistant to isoniazid (H) and rifampicin (R) with or without resistance to other drugs. RR - Rifampicin resistant).
# A tibble: 43 × 2
variable_name definition
<chr> <chr>
1 country Country or territory name
2 rr_new Number of new bacteriologically confirmed pulmonary TB patient…
3 c_newinc_100k Case notification rate, which is the total of new and relapse …
4 cfr Estimated TB case fatality ratio
5 cfr_hi Estimated TB case fatality ratio: high bound
6 cfr_lo Estimated TB case fatality ratio: low bound
7 cfr_pct Estimated TB case fatality ratio expressed as a percentage
8 cfr_pct_hi Estimated TB case fatality ratio: high bound expressed as a pe…
9 cfr_pct_lo Estimated TB case fatality ratio: low bound expressed as a per…
10 e_inc_100k Estimated incidence (all forms) per 100 000 population
# ℹ 33 more rows
After considering the definitions, we can select only the important variables that hold the most descriptive information to get to know the data. We are selecting data per 100k to get the most comparable summary.
Warning: There was 1 warning in `reframe()`.
ℹ In argument: `across(where(is.numeric), list(mean = mean, sd = sd), na.rm =
TRUE)`.
Caused by warning:
! The `...` argument of `across()` is deprecated as of dplyr 1.1.0.
Supply arguments directly to `.fns` through an anonymous function instead.
# Previously
across(a:b, mean, na.rm = TRUE)
# Now
across(a:b, \(x) mean(x, na.rm = TRUE))
Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
ℹ Please use `linewidth` instead.
Interestingly, household contacts seemed to have dropped rapidly during Covid-19 quarantine (happened around april of 2020), only to increase rapidly in the time afterwards.